SAW: an efficient and accurate data analysis workflow for Stereo-seq spatial transcriptomics

The basic analysis steps of spatial transcriptomics require obtaining gene expression information from both space and cells. The existing tools for these analyses incur performance issues when dealing with large datasets. These issues involve computationally intensive spatial localization, RNA genome alignment, and excessive memory usage in large chip scenarios. These problems affect the applicability and efficiency of the analysis. Here, a high-performance and accurate spatial transcriptomics data analysis workflow, called Stereo-seq Analysis Workflow (SAW), was developed for the Stereo-seq technology developed at BGI. SAW includes mRNA spatial position reconstruction, genome alignment, gene expression matrix generation, and clustering. The workflow outputs files in a universal format for subsequent personalized analysis. The execution time for the entire analysis is ∼148 min with 1 GB reads 1 × 1 cm chip test data, 1.8 times faster than with an unoptimized workflow.

In this step, CID and coordinates need to be recorded in memory for real-time query and spatial positioning of reads.Faced with large chips, such as S6 (6 cm × 6 cm), the spatial coordinate points can reach as many as 15 billion, and the data structure that stores the correspondence between the CID and the coordinates occupies a lot of memory.Moreover, querying in such a large table can be slow, especially when considering fault tolerance, as the computational complexity and time consumption can increase further.Finally, on large chips, matrix operations of the same size as the chip also have performance bottlenecks, such as excessive memory usage and slow speed.These problems need to be solved through high-performance computing technologies.
We developed the Stereo-Seq Analysis Workflow (SAW), a standard analysis process of Stereo-Seq data.Taking FASTQ [3,4] as inputs, SAW performs mRNA spatial location restoration, filtering, mRNA genome alignment, gene region annotation, molecule identity (MID) correction, expression matrix generation, tissue region extraction, clustering, saturation analysis, and report generation to obtain the gene expression and spatial information of tissues.Hence, SAW provides the complete basic analyses required for spatial transcriptomics data.

IMPLEMENTATION Processing and parallelization of spatial information in large chips
The principle of spatial localization of sequencing data by spatial transcriptomics is to mark the spatial position and sequencing reads with a 25 bp CID sequence.It then requires locating the sequencing reads back to their original spatial position by matching the CID sequence on the reads.However, as the DNA sequence obtained by current sequencers is not 100% accurate and has a certain error rate, a margin of error tolerance is required when matching the CID sequence.The current error tolerance strategy is to replace each base on the CID sequence with the other three bases (the gene sequence comprises four bases, A, G, C, and T) and then perform the matching.
Due to the high resolution and large field of view of Stereo-seq, the number of spatial coordinate points is very large.For example, for the S6 chip (6 cm × 6 cm), the number of spatial coordinate points can reach as many as 15 billion.Simply storing the corresponding relationship between each coordinate point and the CID sequence in a data structure can consume more than 600 GB of memory, and the query speed can be very slow, seriously affecting the applicability and analysis efficiency of standard analyses.Therefore, we split the spatial coordinates and CID information, which are stored in a mask file.Correspondingly, the FASTQ files are also split according to the same rules.For example, if the mask file needs to be split into four parts, the first base of the CID sequence can be used as the classification criterion and split into four parts starting with A, C, G, and T. If 16 parts are needed, the first two bases can be used as the classification criterion.If a non-power-of-4 number of parts is needed, such as ten parts, we can use a modulo operation.Similarly, the FASTQ file can be split according to the CID sequence using the same rule, and the corresponding mask and FASTQ files belonging to the same category need to undergo CID mapping.This step solves the above memory problem and improves the parallelization of data processing (Figure 1).The mask file (which records the corresponding relationship between the CID sequence and the spatial coordinates) and the FASTQ split are paired for subsequent analyses, and then merged when needed.

Rapid alignment of genomes
The successful positioning of spatial location in read goes through several filtering steps, including whether MID contains N bases, whether MID is poly-A, the MID quality value, and whether mRNA contains poly-A.Reads filtered through these steps undergo genome alignment and output a BAM [5] file containing alignment information.Currently, commonly used RNA alignment tools include STAR, Hisat2 [6]  TopHat2 [7] (RRID:SCR_013035).Among them, STAR is known for its high unique alignment rate and relatively fast speed, but it still cannot meet the requirements of sizeable spatial omics datasets.Therefore, we made a series of optimization attempts, including using efficient multi-threaded input-output (IO) models, single instruction multiple data, improving L2 cache hit rate and other micro-architectural optimization techniques, redesigning business-level algorithms in data processing, and using FM-index technology in maximum matching prefix search, ultimately accelerating it by two times.

Gene expression matrix generation
Gene expression quantification analysis of STOmics is achieved through the count tool in the analysis software.Count annotates uniquely mapped reads based on mapping alignment results, combined with the reference gene annotation file (GFF/GTF [8,9]) of the corresponding species, and then corrects and deduplicates MID, generating processed BAM files, gene annotation, and MID correction and deduplication.

Gene region annotation process
(1) For each read, search for overlap with the gene interval in the annotation file, calculate the gene name/strand on the annotation, determine whether it is EXONIC, INTRONIC, or INTERGENIC, and count how many belong to each type.
(2) Parse the cigar information.Obtain the entire length of the read and the starting position and length of each align block.
(3) Search for all overlapping genes.(i) First, obtain a list of genes with the best annotation results (priority is given to EXONIC>INTRONIC>INTERGENIC).
(ii) From these genes, select the gene with the most significant overlap as the annotation result.
(iii) If multiple genes have the same overlap length, randomly select one gene (the selection rule is to choose the gene with the smallest start and end).

MID correction
The following process corrects the error MID caused by sequencing errors based on the Hamming distance:  (2) Filter out reads with directions opposite to the annotated gene chain direction.
(3) Group by coordinate, gene, and MID in order.
(4) Count the number of unique MIDs for each coordinate and gene, which is the expression matrix.Calculate the fault tolerance count between "CCC" and "AAA" -"GGG", all of which are greater than the threshold of 1.
Calculate the fault tolerance count between "GGG" and "AAA" -"AAT", and find that when encountering "GGA", the fault tolerance count is 1.Then, update the count values of both and record the corresponding relationship before and after correction.
Calculate the fault tolerance count between "AAT" and "AAA" -"AGA", and find that when encountering "AAA", the fault tolerance count is 1.Then, update the count values of both and record the corresponding relationship before and after correction.
Calculate the fault tolerance count between "AGA" and "AAA" -"GGA", and find that when encountering "AAA", the fault tolerance count is 1.Then update the count values of both and record the corresponding relationship before and after correction.
(3) Clustering analysis.An unsupervised clustering is performed using the Leiden [13] algorithm, a graph-based clustering method.First, a neighborhood graph is constructed based on the similarity between cells, where each cell is considered a node and the connections between nodes represent their similarity.Each node is initially considered a separate cluster, and the modularity of the entire graph is calculated.Then, adjacent nodes are iteratively merged to improve modularity.In each iteration, the modularity of merging each node with its neighboring nodes is calculated, and the merging method with the highest modularity is selected.When the iterative process converges, the final clustering result is reached.

Preparation
The formula for calculating the saturation value is 1-(uniq reads/total reads).First, 5% of bin200 unique coordinates are sampled and restored to bin1 coordinates.Then, the sampled bin1 coordinates under tissue are used to filter data, a list of (x, y, gene, MID) is constructed, and all count values to obtain anno reads are accumulated.

Saturation calculation
Shuffle the previous list, process the data in order according to the sampling interval of {0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}, and, for each sampling point, calculate the total reads/saturation value/median number of genes under bin1/bin200, respectively, and output the statistical result file.
Saturation value.Calculate the uniq reads (i.e., gene and MID are both uniq) and the total reads under all bins using the formula 1-(uniq reads/total reads).The algorithms for bin1 and bin200 are the same.
Median number of genes.Calculate the number of uniq genes under each bin and take the median.The algorithms for bin1 and bin200 are the same.
Uniq reads.The uniq reads of bin1 are all the uniq reads (i.e., gene and MID are both uniq) under all bin1; the uniq reads of bin200 are all the uniq reads under all bin200 (x, y coordinates of bin1, gene, and MID are all uniq).

Memory issues in large chip scenarios
In large chip scenarios, in addition to the mask file, storing gene expression information in large IO files and matrix calculations can cause excessive memory consumption.To solve this problem, we used a series of optimization techniques, including batch processing of large IO files and partial matrix calculations, pre-calculating sizes to avoid using dynamically expanding data structures, and designing more finely-tuned custom data-types with smaller memory overhead based on business characteristics.These techniques enabled large chip data to be successfully completed on ordinary memory machines (256 GB).

EXAMPLES mRNA spatial position restoration, filtering, and genome alignment statistics
Taking SS200000135TLD1 data (https://github.com/BGIResearch/SAW/tree/main/testdata)as an example, we executed mRNA spatial position restoration, filtering, and genome

Gene expression spatial distribution map
The BAM file generated after genome alignment was annotated and MID-corrected to produce expression information and statistical results.The expression information was stored in hdf5 format and could be visualized (Figure 3).The statistical results provided the number of exonic regions, introns, and intergenic regions annotated.A total of 481 MB reads were annotated in the exonic region (Table 1).

Spatial clustering
Gene expression profiles were calculated for each position within bin200 (a 200 × 200 grid of points), then spatial clustering was performed (Figure 4).This resulted in 21 classifications, which roughly aligned with the cell clustering results.

Saturation analysis
Our saturation analysis showed that the median sequencing depth and number of gene species tended to saturate, while the number of unique reads did not yet saturate (Figure 5).

Testing
Each pipeline tool was optimized through a series of high-performance computing techniques.We conducted performance tests on three samples to evaluate changes in runtime and memory.Data1 and data2 are both s1 chips (1 cm × 1 cm) with around 1 billion reads, while data3 is a large chip (2 cm × 3 cm).After optimization, the runtime on data1 decreased from 263.1 to 148.1 min, resulting in a 1.8× speed increase, and the mapping time decreased from 175.0 to 106.9 min.On data2, the runtime decreased from 227.9 to 127.3 min, resulting in a 1.8× speed increase, and the time of mapping decreased from 158.0 to 83.7 min (Table 2).In terms of memory optimization, after process optimization, the memory peak of tissueCut on data3 decreased from well over 83.5 GB to 33.5 GB.

FUTURE DIRECTIONS
The alignment rate of CID affects the amount of data that enters the subsequent analysis.

( 4 )
Determine which gene to choose.For each gene: (a) For each align block, calculate with each transcript of the current gene, obtain multiple exoncnt/introncnt based on the length of overlap with exon/non-exon regions.(b) Accumulate the cnts of each align block to obtain the optimal exoncnt/introncnt.If the exoncnt is greater than or equal to 50% of the read length, mark it as EXONIC.Otherwise, if the introncnt is greater than or equal to 50% of the read length, mark it as INTRONIC.Otherwise, mark it as INTERGENIC.(c)Choose the most reliable gene from multiple genes.

( 1 )
Data preparation.A nested map of the form {cidgene: {mid: cnt}} stores the number of each CID and gene combination under various MIDs.

( 2 )
Correction.(a) Set parameters: minimum number of MID types threshold/tolerance number threshold/MID length.Default: 5/1/10.(b) Correction.For each group of data in the nested map: (i) Check the number of MID types and continue processing if it is greater than or equal to the threshold.(ii) Sort in descending order according to the cnt of MID, obtaining a list in the form of [(mid1, cnt1), (mid2, cnt2), …].(iii) Traverse the sorted list in reverse order, starting with the MID with the smallest cnt, and calculate the base error with other MIDs.If it satisfies the tolerance number threshold, correct the current MID to the MID with a larger cnt, and transfer the cnt of the current MID to the correct MID.(iv) Obtain the nested map after correction in the form of {cidgene: {oldmid: newmid}}.Expression matrix (1) Select reads annotated to EXON or INTRON.

( 3 )
ExampleGiven a fault tolerance threshold of 1, assuming the sorted MID sequence and count are as follows:

Figure 2 .
Figure 2. Summary of spatial position restoration, filtering, and genome alignment.

Figure 3 .
Figure 3.A demo of the spatial visualization of gene expression information.
With our test sample, we obtained a successful alignment rate of 78.8% reads.Due to sequencing errors and alignment algorithm limitations, approximately 20% of reads could not be aligned.In the future, more accurate algorithms (such as those that consider mask CID and fastq CID-mismatch base quality-values and spatial features) or deep learning models may further improve the accuracy of the pipeline.

Figure 4 .
Figure 4.A demo of the spatial clustering for a mouse brain dataset.

Figure 5 .
Figure 5.A demo of reads saturation statistics.

Table 1 .
Demo of the reads annotation statistics.

Table 2 .
The elapsed time used by the analysis workflow before and after optimization.